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Abstract 

!- _ 

Qh) Nonnegative matrix factorization (NMF) has become a very popular technique in machine 

■ learning because it automatically extracts meaningful features through a sparse and part-based 

representation. However, NMF has the drawback of being highly ill-posed, that is, there typically 
exist many different but equivalent factorizations. In this paper, we introduce a completely new way 
to obtaining more well-posed NMF problems whose solutions are sparser. Our technique is based on 
the preprocessing of the nonnegative input data matrix, and relies on the theory of M-matrices and 
the geometric interpretation of NMF. This approach provably leads to optimal and sparse solutions 
under the separability assumption of Donoho and Stodden [12J . and, for rank-three matrices, makes 
the number of exact factorizations finite. We illustrate the effectiveness of our technique on several 
image datasets. 

Keywords, nonnegative matrix factorization, data preprocessing, uniqueness, sparsity, inverse- 
positive matrices. 
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^2 ■ 1 Introduction 

Given an m-by-n nonnegative matrix M > and a factorization rank r, nonnegative matrix factoriza- 
tion (NMF) looks for two nonnegative matrices U and V of dimension m-by-r and r-by-n respectively 
' such that M ~ UV. To assess the quality of an approximation, a popular choice is the Frobenius norm 
of the residual ||M — f7V||jr and NMF can for example be formulated as the following optimization 
problem 

1 min \\M - UV\\ 2 F such that U > and V > 0. (1.1) 

rN 1 (7eM mxr ,yGlR rx ' 1 

b ; 

Assuming that M is a matrix where each column represents an element of a dataset (e.g., a vectorized 
image of pixel intensities), NMF can be interpreted in the following way. Since M-j ~ zCfc=i ^-.kVkj 
Vj, each column M-j of M is reconstructed using an additive linear combination of nonnegative basis 
elements (the columns of U). These basis elements can be interpreted in the same way as the columns 
of M (e.g., as images). Moreover, they can only be summed up (since V is nonnegative) in order 
to approximate the original data matrix M which leads to a part-based representation: NMF will 
automatically extract localized and meaningful features from the dataset. The most famous illustration 
of such a decomposition is when the columns of M represent facial images for which NMF is able to 
extract common features such as eyes, nodes and lips [25J; see Figure [8] in Section [6l 

NMF has become a very popular data analysis technique and has been successfully used in many 
different areas, e.g., hyperspectral imaging [28], text mining [38], clustering [9], air emission control 
[27] . blind source separation [7J, and music analysis [13] . 
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1.1 Geometric Interpretation of NMF 

A very useful tool for understanding NMF better is its geometric interpretation. In fact, NMF is 
closely related to a problem in computational geometry consisting in finding a polytope nested be- 
tween two given polytopes. In this section, we briefly recall this connection, which will be extensively 
used throughout the paper. 

Let (U, V) be an exact NMF of M (i.e., M = UV, U > and V > 0), and let us assume that no 
column of U or M is all zeros; otherwise they can be removed without loss of generality. 

Definition 1 (Pullback map). Given an m-by-n nonnegative matrix X without all-zero column, D(X) 
is the n-by-n diagonal matrix whose diagonal elements are the inverse of the l\-norms of the columns 
ofX: 




D(X) li = \\X. A \\^ = \ \X ki \\ Vi, D(X) ij = QVi^j, (1.2) 



and 9{X) = XD(X) is the pullback map of X so that 9{X) is column stochastic, i.e., 9(X) is non- 
negative and its columns sum to one. 

We have that (see, e.g., [6]) 

M = UV 9(M) = MD(M) = UD(U)D(Uy 1 VD(M) 9{M) = 9{U)V , 

8(U) V 

where V' must be column stochastic since 6(M) and 0(U) are both column stochastic and 9{M) = 
9(U)V. Therefore, the columns of 9{M) are convex combinations (i.e., linear combinations with 
nonnegative weights summing to one) of the columns of 6(U). This implies that 

conv(6>(M)) C eonv(0(f/)) C A m , (1.3) 

where conv(X) denotes the convex hull of the columns of matrix X, and A m = {x £ M m \ x i = 
1, Xi > 1 < i < m} is the unit simplex (of dimension m — 1). An exact NMF M = UV can then be 
geometrically interpreted as a polytope T = conv(0(U)) nested between an inner polytope conv(#(M)) 
and an outer polytope A m . 

Hence finding the minimal number of nonnegative rank-one factors to reconstruct M ex- 
actly is equivalent to finding a polytope T with minimum number of vertices nested between 
two given polytopes: the inner polytope conv(#(M)) and the outer polytope A m . 

This problem is referred to as the nested polytopes problem (NPP), and is then equivalent to computing 
an exact nonnegative matrix factorization |19j (see also [16] and the references therein). In the 
remaining of the paper, we will denote NPP(M) the NPP instance corresponding to M with inner 
polytope conv(#(M)) and outer polytope A m . 

Remark 1. The geometric interpretation can also be equivalently characterized in terms of cones, 
see rUty . for which we have 

cone(M) C cone(C7) C M™, 

where cone(A) = {x\x = Xa,a > 0}. The geometric interpretation based on convex hulls from 
Equation (|1.3j) amounts to the intersection of the cones with the hyperplane {x\ ^ Xj = 1} (this is the 
reason why zero columns of M and U need to be discarded in that case). 
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1.2 Uniqueness of NMF 

There are several difficulties in using NMF in practice. In particular, the optimization problem 
(jl.ip is NP-hard [36], and typically only convergence to stationary points is guaranteed by standard 
algorithms. There does not seem to be an easy way to go around this (except if the factorization rank 
is very small [2]) since NMF problems typically have many local minima. 

Another difficulty is the non-uniqueness: even if one is given an optimal (or good) NMF (U, V) 
of M, there might exist many equivalent solutions {UQ^Q^V) for non-monomiaU matrices Q with 
UQ > and Q~ l V > 0, see, e.g., [24J. Such transformations lead to different interpretations, especially 
when the supports of U and V change. For example, in document classification, each entry Mjj of 
matrix M indicates the 'importance' of word i in document j (e.g., the number of appearances of word 
i in text j). The factors (U,V) of NMF are interpreted as follows: the columns of U represent the 
topics (i.e., bags of words) while the columns of V link the documents to these topics. The sparsity 
patterns of U and V are then a crucial characteristic since they indicate which words belong to which 
topics and which topics is discussed by which documents. 

Different approaches exist to obtain (more) well-posed NMF problems and most of them are based 
on the incorporation of additional constraints into the NMF model, e.g., 

o Sparsity. Require the factors in NMF to be sparse. Under some appropriate assumptions, this 
leads to a unique solution [M] . Geometrically, requiring the matrix U to be sparse is equivalent 
to requiring the vertices of the nested polytope conv(0(f7)) to be located on the low-dimensional 
faces of the outer polytope A" 1 , hence making the problem more well posed. In practice, the 
most popular technique to obtain sparser solutions is to add sparsity inducing penalty terms, 
such as a ^i-norm penalty [22] (see also Section [6]). Another possibility is to use a projection 
onto the set of sparse matrices [20] . 




o Minimum Volume. Require the polytope conv(0(U)) to have minimum volume, see, e.g., 
|26} |2~H 139] : this has a long history in hyperspectral imaging [8]. Again, this constraint is 
typically enforced using a proper penalty term in the objective function. Volume maximization 
of the polytope is also possible, leading to a sparser factor U (since the columns of U will be 
encouraged to be on the faces of A" 1 ), see, e.g., [37], which is essentially equivalent to performing 
volume minimization for the matrix transpose. In fact, taking the polar of the three polytopes 
in Equation (jl.3|) interchanges the role of the inner and outer polytopes, while the polar of 
conv(0(M)) is given by conv(#(M T )), see, e.g., [HI Section 3.6]. 

Orthogonality. Require the columns of matrix U to be orthogonal [TU]. Geometrically, it 
amounts to position the vertices of conv(#(t/)) on the low-dimensional faces of A m so that if one 
of the columns of 6(U) is not on a facet of A m (i.e., > for some i, k), then all the other 
columns of U must be on that facet (i.e., Ui P = \/p ^ k). This condition is rather restrictive, 
but proved successful in some situations, e.g., for clustering, see [9j [30] . 

1.3 Outline of the Paper 

In this paper, we address the problem of uniqueness and introduce a completely new approach to 
make NMF problems more well posed, and obtain sparser solutions. Our technique is based on a 
preprocessing of the input matrix M to make it sparser while preserving its nonnegativity and its 
column space. The motivation is based on the geometric interpretation of NMF which shows that 
sparser matrices will correspond to more well-posed NMF problems whose solutions are sparser. 

In Section [21 we recall how sparsity of M makes the corresponding NMF problem more well posed. 
In particular, we give a new result linking the support of M and the uniqueness of the corresponding 
NMF problem. 

1 A monomial matrix is a permutation of a diagonal matrix with positive diagonal elements. 
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In Section [3l we introduce a preprocessing V{M) = MQ of M where Q is an inverse-positive 
matrix, i.e., Q has full rank and its inverse Q _1 is nonnegative. Hence, if (U,V) is an NMF of 
V{M) with V(M) » UV', then (C^Q" 1 ) is an NMF of M since M = V(M)Q- 1 ss [/V'Q- 1 and 

y'Q- 1 > 0. 

In Section [H we prove some important properties of the preprocessing; in particular that it is 
well-defined, invariant to permutation and scaling, and optimal under the separability assumption of 
Donoho and Stodden |12| . Moreover, in the exact case for rank-three matrices (i.e., M = UV and 
rank(Af) = 3) we show how the preprocessing can be used to obtain an equivalent NMF problem with 
a finite number of solutions. 

In Section O we address some practical issues of using the preprocessing: the computational cost, 
the rescaling of the columns V(M) and the ability to dealing with sparse and noisy matrices. 

In Section [6l we present some very promising numerical experiments on facial and hyperspectral 
image datasets. 

2 Non-Uniqueness, Geometry and Sparsity 

Let M G M™ xn and (U, V) G M+ Xr x W* n be an exact nonnegative matrix factorization of M, i.e., 
M = UV. The minimum r such that such a decomposition exists is the nonnegative rank of M and 
will be denoted rank + (M). If U is not full rank (i.e., rank(C/) < r), then the decomposition is typically 
not unique. In fact, the convex combinations (i.e., V > 0) cannot in general be uniquely determined: 
the polytope T = conv(0(f/)) has r vertices while its dimension is strictly smaller than r — 1 implying 
that any point in the interior of T can be reconstructed with infinitely many convex combinations of 
the r vertices of T. However, if all columns of conv(#(M)) are located on /c-dimensional faces of T 
having exactly k + 1 vertices, then the convex combinations given by V are unique |32j . 

In practice, it is therefore often implicitly assumed that rank + (M) = rank(M) = r hence rank({7) = 
r (since U has r columns and spans the column space of M of dimension r); see the discussion in [2] 
and the references therein. In this situation, the uniqueness can be characterized as follows: 

Theorem 1 ([21]). Let (U, V) £ R™ xr x R r + Xn and M = UV with rank(M) = rank(?7) = r. Then the 
following statments are equivalent: 

(i) The exact NMF (U, V) of M is unique (up to permutation and scaling). 

(ii) There does not exist a non-monomial invertible matrix Q such that U' = UQ > and V' = 
Q- X V > 0. 

(Hi) The polytope conv(9(U)) is the unique solution of NPP(M) with r vertices. 

It is interesting to notice that the columns of M containing zero entries are located on the boundary 
of the outer polytope A m , and these points must be on the boundary of any solution T of NPP(M). 
Therefore, if M contains many zero entries, it is more likely that the set of exact NMF of M will be 
smaller, since there is less degree of freedom to fill in the space between the inner and outer polytopes. 
In particular, Donoho and Stodden [T2] showed that "requiring that some of the data are spread across 
the faces of the nonnegative orthant, there is unique simplicial cone" , i.e., there is a unique conv(#(C/)). 

In the following, based on the assumption that rank(M) = rank + (M), we provide a new uniqueness 
result using the geometric interpretation of NMF and the sparsity pattern of M. 

Lemma 1. Let M G ]R mxra with r = rank(M) = rank + (M), and M have no all-zero columns. If r 
columns of 6{M) coincide with r different vertices of A m n col(#(M)) ; then the exact NMF of M is 
unique. 
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Proof. Let (U, V) G M™ xr x R^ xn be such that M = UV . Since r = rank(M) = rank+(M), we must 
have rank({7) = r and col({7) = col(M) (where co\(X) denotes the column space of matrix X), hence 

conv(0(M)) C conv(0(£7)) C A m n col(0(M)). 

Since r columns of 9(M) coincide with r vertices of A m n col(#(M)), we have that conv(0(£/)) = 
conv(#(M)) is the unique solution of NPP(M), and Theorem [1] allows to conclude. □ 

In order to identify such matrices, it would be nice to characterize the vertices of A m n eol(0(M)) 
based solely on the sparsity pattern of M. By definition, the vertices of A m n col(#(M)) are the 
intersection of r — 1 of its facets, and the facets of A m n col(0(M)) are given by 

Fi = {x G A m n col(0(Af)) | Xi = 0}. 

Therefore, a vertex of A m n col(0(M)) must contain at least r — 1 zero entries. However, this is not 
a sufficient condition because some facets might be redundant, e.g., if the ith row of M is identically 
equal to zero (for which Fi = A m n col(0(M))) or if the ith and jth row of M are equal to each other 
(for which F{ = Fj). 

Lemma 2. A column of M containing r — 1 zeros whose corresponding rows have different sparsity 
patterns corresponds a vertex of conv(9(M)) n A m . 

Proof Let c be one of the columns of M with at least r — 1 zeros corresponding to rows with different 
sparsity patterns (i.e., different supports), i.e., there exists I C {i \ Cj = 0} with 1 1\ = r — 1 such that 
the rows of M(I, :) have different sparsity patterns. Let also F^ = {x \ Xjn^ = 0} for 1 < k < r — 1 
denote the r — 1 facets with 9(c) £ Fk Vfc. To show that 9(c) is a vertex of conv(#(M)) n A m , it 
suffices to show that the r — 1 facets are not redundant, i.e., for alll<A;<p<r — 1, there exist xj~ 
and x p in conv(#(M)) n A m such that Xk G Fk,Xk $ F p and x p G F p ,x p £ Fk- Because the rows of 
M(I, :) have different sparsity patterns, for all 1 < k < p < r — 1, there must exist two indices h and 
I such that M(I(k),h) = and M(I(p),h) > while M(I(k),l) > and M(I(p),l) = 0. Therefore, 
9(M. h ) G F k , 9(M. h ) F p and 9(M d ) £ F p , 9(M d ) <£ F k and the proof is complete. □ 



Theorem 2. Let M G R mxn with r = rank(M) = 
having r — 1 zero entries whose corresponding rows 
M is unique. 

Proof This follows directly from Lemma [1] and [2j 



= rank_|_(M). If M has r non-zero columns each 
have different sparsity patterns, then the NMF of 

□ 



Here is an example, 

/0 1 1\ 
1 
M = 1 ' 
V i i o / 

with rank(M) = rank + (M) = 3 whose unique NMF is M = MI. Other examples include matrices 
containing an r-by-r monomial submatrix. It is also interesting to notice that this result implies that 
the only 3-by-3 rank-three nonnegative matrices having a unique exact NMF are the monomial ma- 
trices (i.e., permutation and scaling of the identity matrix) since all other matrices have at least two 
distinct exact NMF: M = MI = IM. 

Finally, the geometric interpretation of NMF shows that sparser matrices M lead to more well- 
posed NMF problems because many points of the inner polytope in NPP(M) are located on the 
boundary of the outer polytope. Moreover, because the solution T must contain these points, it will 
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have zero entries as well. In particular, assuming M does not contain a zero column, it is easy to 
check that for M = UV we have 

Mij =0 => 3k such that 11^ = 0. 

Remark 2. Having many zero entries in M is not a necessary condition for having an unique NMF. 
In fact, Laurberg et al. \2J$ showed that there exist positive matrices with unique NMF. However, for 
an NMF (U,V) to be unique, the support of each columns of U (resp. row ofV) cannot be contained 
in the support of any another column (resp. row) so that each column of U (resp. row of V) must 
have at least one zero entry. In fact, assume the support of the kth column of U is contained in the 
support of 1th column. Then noting p = argminjpi^^^^gj jj^f^, e = , and 

Dm = — e, Da = 1 Vi, T)%j = otherwise, 

one can check that D -1 is as follows 

D k l = e, D- 1 = 1 Vi, D^ 1 = otherwise, 

that is D- 1 > 0. Therefore (UD^^V) is an equivalent NMF with a different sparsity pattern since 
(UD) a = UD d = U a - eU :k > 0, and U pl > while (UD) pl = 0. 

3 Preprocessing for More Well-Posed and Sparser NMF 

In this section, we introduce a completely new approach to obtain more well-posed NMF problems 
whose solutions are sparser. As it was shown in the previous paragraph, this can be achieved by working 
with sparser nonnegative matrices. Hence, we look for an n-by-n matrix Q such that MQ = M' is 
nonnegative, sparse and Q is inverse-positive. In other words, we would like to solve the following 
problem: 

min llMQUo such that MQ > and Q' 1 > 0, (3.1) 

QeM nx ™ 

where ||-X"||o is the £o-' norm ' which counts the number of non-zero entries in X. Assuming we can 
solve (J3TTD and obtain a matrix M' = MQ, then any NMF (17, V) of M' with M' w UV gives a NMF 
for M. In fact, 

M = M'Q- 1 rj UV'Q- 1 = UV, where V = V'Q' 1 > 0, 

for which we have 

\\M - UV\\ F = \\M'Q- 1 - UV'Q' l \\ F = \\(M' - UV')Q- X \\ F < \\M' - UV'\\ F ||Q _1 || 2 . 

In particular, if the NMF of M' is exact, then we also have an exact NMF for M = M'Q" 1 = 
UV'Q -1 = UV. The converse direction, however, is not always true. We return to this point in 
Section 14.31 

In the remaining of this section, we propose a way to finding approximate solutions to problem 
(|3.ip . First, we briefly review some properties of inverse-positive matrices (Section 13. ip in order to 
deal with the constraint Q^ 1 > 0. Then, we replace the ^o - ' 110 ™ 1 ' with the ^-norm and solve the 
corresponding optimization problem using constrained linear least squares (Section I3.2p . 

3.1 Inverse-Positive Matrices 

In this section, we recall the definition of three types of matrices: Z-matrices, M-matrices and inverse- 
positive matrices, briefly recall how they are related and provide some useful properties. We refer the 
reader to the book of Berman and Plemmons [3] and the references therein for more details on the 
subject. 
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Definition 2. An n-by-n Z-matrix is a real matrix with non-positive off-diagonal entries. 

Definition 3. An n-by-n M-matrix is a real matrix of the following form: 

A = sl-B, s>0, B>0, 

where the spectral radiu^ p(B) of B satisfies s > p{B). 

It is easy to see that an M-matrix is also a Z-matrix. 

Definition 4. An n-by-n matrix Q is inverse positive if and only ifQ~ l exists and Q _1 is nonnegative. 
We will note this set XV 11 : 

XV n = {Q G R nxn | Q is full rank and Q" 1 > 0}. 

It can be shown that inverse-positive Z-matrices are M-matrices: 

Theorem 3 ([3j Theorem 2.3]). Let A be a Z-matrix. Then the following conditions are equivalent : 

o A is an invertible M-matrix. 

o A = si- B with B > 0, s> p(B). 

o A Si?", i.e., A is inverse positive. 

Here is another well-known theorem in matrix theory which will be useful, see, e.g., \i'6\ [TT] . 

Definition 5. An n-by-n matrix A is irreducible if and only if there does not exist an n-by-n permu- 
tation matrix P such that 

pTAP =(l ° D )- 

where B and D are square matrices. 

Definition 6. An n-by-n matrix A is irreducibly diagonally dominant if A is irreducible, 

\Au\ > ^2 \Aki\, f° r i = 1) 2, . . . , n, (Diagonal Dominance) 

and the inequality is strict for at least one i. 

Theorem 4. If A is irreducibly diagonally dominant, then A is nonsingular. 
3.2 Constrained Linear Least Squares Formulation for ( 13.1 j) 

The ^o-' norm ' is of combinatorial nature and typically leads to intractable optimization problems. The 
standard approach is to use the ^i-norm instead but we propose here to use the ^2-norm. The reason 
is twofold: 

o When looking at the structure of problem (|3.ip . we observe that any (reasonable) norm will 
induce solutions with zero entries. In fact, some of the constraints MQ > will always be active 
at optimality because of the objective function ||MQ||. 

o The ^2" n orm is smooth hence its optimization can be performed more efhcienthjf]. 

2 The spectral radius p(B) of a n-by-n matrix B is the supremum among all the absolute values of the eigenvalues of 
B, i.e., p(B) = max, \\i(B)\. 

3 Because of the constraint MQ > 0, the ^i-norm problem can actually be decoupled into n linear programs (LP) in 
n variables and m + n constraints, and can be solved effectively. However, in the noisy case (cf. Section [S3J, we would 
need to introduce inn auxiliary variables (one for each term of the objective function) which turns out to be impractical. 
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We then would like to solve 



mm n \\MQ\\p such that MQ > 0. (3.2) 

Optimizing over the set of inverse-positive matrices XV n seems to be very difficult. At least, describing 
XV n explicitly as a semi-algebraic set requires about n 2 polynomial inequalities of degree up to n, 
each with up to n! terms. However, we are not aware of a rigorous analysis of the complexity of this 
type of problems; this is a topic for further research. 

For this reason, we will restrict the search space to the subset of Z-matrices, i.e., inverse-positive 
matrices of the form Q = sI—B, where s is a nonnegative scalar, / is the identity matrix of appropriate 
dimension and B is a nonnegative matrix such that p(B) < s, see Section T3. II It is important to notice 
that 

o The scalar s cannot be chosen arbitrarily. In fact, making s go to zero and B = 0, the objective 
function value goes to zero, which is optimal for (|3. 2[) . The same degree of freedom is in fact 
present in the original problem (|3.ip since Q and aQ for any a > are equivalent solutions. 
Therefore, without loss of generality, we fix s to one . 

o The diagonal entries of B cannot be chosen arbitrarily. In fact, taking B arbitrarily close (but 
smaller) to the identity matrix, the infimum of (|3.2|) will be equal to zero. We then have to 
set an upper bound (smaller than one) for the diagonal entries of B. It can be checked that 
this upper bound will always be attained (because of the minimization), and that the optimal 
solutions corresponding to different upper bounds will be multiples of each other. We therefore 
fix the bound to zero implying Ba = for all i, i.e., Qa = 1 for all i. 

Finally, we would like to solve 

min \\MQ\\ 2 F such that MQ > 0, 

QeQ n ~ 

where 

Q n = {Q e R nxn \Q = I-B,B>0,Bu = 0Vi, p(B) < 1} c TV n . 
Since MQ = M(I — B) > 0, this problem is equivalent to 



min V \\M :i -Y j M :k B ki 

i=l kjtt 

such that M > MB, (3.3) 
p(B) < 1, 
Ba = Vi, B > 0. 

Without the constraint on the spectral radius of B, this is a constrained linear least squares problem 
(CLLS) in 0(n 2 ) variables and 0(n 2 + mn) constraints. The ith. column of M' = MQ, which is the 
preprocessed version of M, will then be given by the following linear combination 

n 

M' A = MQ. A = M. A - ^2 M ± B ki > 0, where B ki >0\/i,k and Ba = 0. (3.4) 

k=l 

This means that we will subtract from each column of M a nonnegative linear combination of the 
other columns of M in order to maximize its sparsity while keeping its nonnegativity. Intuitively, this 
amounts to keeping only the non-redundant information from each column of M (see Section [6] for 
some visual examples). 
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3.2.1 Relaxing the Constraint on the Spectral Radius 



In general, there is no easy way to deal with the non-convex constraint p(B) < 1. In particular, this 
constraint may lead to difficult optimization problems, e.g., finding the nearest stable matrix to an 
unstable one: 

min \\X - A\\ such that p(X) < 1, 

see [29] and the references therein. This means that even the projection of the feasible set is non-trivial. 

However, we will prove in Section H] that if the columns of M are not multiples of each other, then 
any optimal solution of problem (|3.3p without the constraint on the spectral radius of B, i.e., any 
optimal solution B* of 



n 



mm 

n X n 

+ i=l k^i 



^2 \\ M:i ~ M - kBki such that M - MB > Bii = v *' ( 3 - 5 ) 



automatically satisfies p(B*) < 1. Hence, the approach may only fail when there are repetitions in the 
dataset. The reason is that when a column is multiple of another one, say M : j = aM : j for i ^ j and 
a > 0, then taking B^ = a (0 otherwise for that column) gives MQ-i = M-i — aM : j = and similarly 
for M-j. Hence we have lost a component in our dataset and potentially produce a lower rank matrix 
M Q. In practice, it will be important to make sure that the columns of M are not multiples of each 
other (even though it is usually not the case for well-constructed datasets). 



4 Properties of the Preprocessing 

In the remainder of the paper, we denote B*(M) the set of optimal solutions of problem (|3.5p for the 
data matrix M, and V the preprocessing operator defined as 

V : M+ Xn -> M+ Xn : M V(M) = M(I - B*), where B* € B*(M). 

In this section, we prove some important properties of V and B*(M): 

o The preprocessing operator V is well-defined (Theorem [5]). 

o The preprocessing operator V is invariant to permutation and scaling of the columns of M 
(Lemma [3]). 

o If the columns of 9{M) are distinct, then p(B*) < 1 for any B* £ B*(M) (Theorem [6]). 
o If the vertices of conv(#(M)) are distinct then 

- There exists B* £ B*(M) such that p(B*) < 1 (Corollary E}. 

- rank(P(M)) = rank(M) and rank + (-p(M)) > rank + (M) (Corollary Q}. 

o If the matrix M is separable, then the preprocessing allows to recover a sparse and optimal 
solution of the corresponding NMF problem (Theorem [7|) . In particular it is always optimal for 
rank- two matrices (Corollary [3]). 

o If the matrix has rank-three, then the preprocessing yields an instance in which the number of 
solutions of the exact NMF problem is finite (Theorem [9]) . 
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4.1 General Properties 

A crucial property of our preprocessing is that it is well-defined. 

Theorem 5. The preprocessing V(M) is well-defined, i.e., for any B* G B*{M),B^ G B*{M), we 
have M(I - B{) = M(I - B%) = V(M). 

Proof. Problem (|3.5p can be decoupled into n independent CLLS (one for each column of M) of the 
form: 

min \\d — Cb\\ 2 such that Cb < d = min \\d — y\\ 2 such that y < d,y = Cb. (4.1) 

The result follows from the fact that the £2 projection onto a polyhedral set (actually any convex set) 
yields a unique point. □ 

Another important property of the preprocessing is its invariance to permutation and scaling of 
the columns of M. 

Lemma 3. Let M be a nonnegative matrix and P be a monomial matrix. Then, V{MP) = V{M)P. 

Proof. We are going to show something slightly stronger; namely that B* is an optimal solution of 
(|3.5p for matrix M if and only if P~ 1 B*P is an optimal solution of (|3.5p for matrix MP, i.e., 

B* G B*(M) P~ 1 B*P G B*(MP). 

First, note that B is a feasible solution of (|3.5|) for M if and only if P~ 1 BP is a feasible solution of 
(|3.5p for MP. In fact, nonnegativity of B and its diagonal zero entries are clearly preserved under 
permutation and scaling while 

M > MB <=^> MP > MBP <=^> MP > M(PP~ l )BP MP > (MP)(p- 1 BP). 

Hence there is one-to-one correspondence between feasible solutions of (|3.5|) for M and (|3.5p for MP. 

Then, let B* be an optimal solution of (|3.5p . Because (|3.5p can be decoupled into n independent 
CLLS's, one for each column of B (cf. Equation (|4.1|) ). we have 

\\M. A -MB*\\l < \\M.i -MB :i \\l, Vi, 



for any feasible solution B of (|3.5|) . Letting p G be such that pi is equal the non-zero entry of the 
ith row of P, we have 

Y,Pi\\M-.i - MB*\\ 2 2 = Y,W M --iPi ~ MPP- l B* Pi \\ 2 2 
i i 

= \\MP - MPP- 1 B*P\\ 2 F 

< ^2p 2 \\M..i -MB :i \\l = \\MP - MPP~ l BP\\ 2 F , 

i 

for any feasible solution B' = P^BP of for MP. This proves B* G B*(M) p- 1 B*P G 

B*{MP). The other direction follows directly by using the permutation on the matrix MP. □ 

It is interesting to observe that if a column of M belongs to the convex cone generated by the 
other columns, then the corresponding column of V(M) is equal to zero. 

Lemma 4. LetX= {1,2,. . . ,n}\{i}. Then V(M). A = if and only if M :i G cone(M(:, J)). 
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Proof. We have that 



V(M). A = M.. l -Y J B* ki M :k = 0, B* H > M :i = £ i^M,, fl£ > 0. 



□ 



The preprocessed matrix V(M) may contain all-zero columns, for which the function 9(.) is not 
defined (cf. Definition [1]) . We extend the definition to matrices with zero columns as follows: 9{X) is 
the matrix whose columns are the normalized non-zero columns of X, i.e., letting Y be the matrix X 
where the non-zero columns have been removed, we define 0(X) = 9(Y). Hence conv(9(X)) denotes 
the convex hull of the normalized non-zero columns of X. 

Another straightforward property is that the preprocessing can only inflate the convex hull defined 
by the columns of 9{M). 

Lemma 5. Let M E M™ xri . If the vertices of conv(6>(M)) are non-repeated, then 
conv(0(M)) C conv(0(P(M))) C A m n col(0(Af)). 

Proof By construction, since V{M) = MQ, col(0(P(M))) C col(0(M)) and conv(9(V(M))) C A m n 
col(0(M)). Let i be the index corresponding to a vertex of 9(M) and I = {1, 2, . . . , n}\{i}. Because 
vertices of 9{M) are non-repeated, we have M-i ^ conv(#(M(:,Z))), while 

V{M) A = M. a ~Y^ hiM :k ^ M. A = V(M). A + ^2 b ki M :k . 

k^i kj^i 

Hence M :i G conv(9([T>(M). A M (:,!)])), which implies that 

conv(^(M)) C conv(0([7 : '(M) : iM(:,X)])), 

so that replacing M A by V{M\ A extends conv(#(M)). Since this holds for all vertices, the proof is 
complete. □ 

Corollary 1. Let M E MJ? xn . If no column of M is multiple of another column, then 

rank(P(M)) = rank(M) and rank + (P(M)) > rank + (M). 

Proof. Without loss of generality, we can assume that M does not have a zero column. In fact, 
a preprocessed zero column remains zero while it cannot influence the preprocessing of the other 
columns (see Equation (|3.4p ). Then, by Lemma [5j we have 

conv(6>(M)) C conv(0(P(M))) C A m n col (6»(M)), 

implying iank + (V(M)) > rank + (M) and rank(7 7 (M)) = rank(M). 

Another way to prove this result is to use Corollary [2] (see below) guaranteeing the existence 
of an inverse-positive matrix Q such that V(M) = MQ which implies rank("P(M)) = rank(M). 
Moreover, any exact NMF (U,V) E M. mxr x R rxn of T(M) gives M = UVQ~ l hence rank+(M) < 
iank + (V(M)). □ 

We now prove that if no column of M is multiple of another column (i.e., the columns of 9{M) 
are distinct) then p{B*) < 1 for any B* E B*(M) whence Q = I — B* is an inverse positive matrix. 



11 



Lemma 6. Let A be a column stochastic matrix and Q = I — B where B > and Ba = for all i be 
such that AQ > 0. Then, 

^B ki <l, \/i, 

k 

i.e., Q is diagonally dominant. Moreover, if A-% ^ conv(^4(:,X)) where I = {1,2, ... ,n}\{i} , then 



< i. 



Proof. By assumption, we have for all i 

A :i > AB :i = J2 A -kB k i 1 = HAiHl > || AB.il |i = \\J2 A -k B ki\\l 



\B. 



■i\ 1 



because A and B are nonnegative. Moreover, if A-i ^ conv(^4(:,Z)), then there exists at least one 
index j such that Aji > Aj._B._i (Lemma EJ) so that the above inequality is strict. □ 



Theorem 6. If no column of M is multiple of another column, then any optimal solution B* of (|3.5p 
satisfies p{B*) < 1, i.e., Q = I — B* is inverse positive. 

Proof. By Theorem O p(B*) < 1 if and only if Q = I — B* is inverse positive if and only if Q is a 
nonsingular M- matrix. Let us then show that Q is a nonsingular M- matrix. First, we can assume 
without loss of generality that 

o Matrix M does not contain a column equal to zero. In fact, if M does, say the first column is 
equal to zero, then we must have B-\ = (since M-_\ > MB_\ and there is not other zero column 
in M). The matrix Q is then a nonsingular M-matrix if and only if Q(2:n, 2:n) is. 



o The columns of M sum to one. In fact, letting P = D(M) be defined as in Equation (|1.2|) . by 
Lemma [3l B* is an optimal solution for M if and only if P~ 1 B*P is an optimal solution for 
MP. Since B* and P~ 1 B*P share the same eigenvalues, p(B*) < 1 <^=^ p(P 1 B*P) < 1. 



o Let B S B*(M), Q = I — B* , and P be a permutation matrix such that 



P T QP 



( Q0) Q(12) Q(13) 

Q( 2 ) 

QP) 

v o ... 



... \ 

... Q( 2fe ) 
. . . Q( 3fc ) 

J 



I B (l) £(12) £(13) 

B& bW 

o o s( 3 ) 

v o ... 



... B^ \ 

B^ 
B&V 

B^ J 



where QW are irreducible for all i. Without loss of generality, by Lemma [31 we can then assume 
that Q has this form. 

In the following we show that is nonsingular for each 1 < p < k hence Q is. By Theorem [U if 
q(p) [ s irreducibly diagonally dominant, then is nonsingular and the proof is complete. We already 
have that is irreducible for 1 < p < k. Let I p denote the index set such that = Q(I p ,I p ). 
We have M(I p , :) is column stochastic, and 

p-i 

V(M)(I P , :) = M(I p , M(Ii, :)B^ - M(I P , :)B® > 0, 

l=i 

implying that M(I p ,:) > M(I p ,:)B^ p \ Moreover the columns of M(I p ,:) are distinct so that there 
is at least one which does not belong to the convex hull of the others. Hence, by Lemma [UJ is 
irreducibly diagonally dominant. □ 
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Corollary 2. Let M 6 M™ xn . If the vertices of conv (8 (M)) are non-repeated, then there exists one 
optimal solution B* £ B*(M) such that p{B*) < 1, i.e., such that Q = I — B* is an inverse-positive 
matrix. 

Proof. Let us show that Q is a nonsingular M-matrix. First, by Lemma O Q is diagonally dominant 
implying p(B) < 1 so that Q is an M-matrix (cf. Theorem[6]). We can assume without loss of generality 
that the r first columns of M correspond to the vertices of conv(#(M)). This implies that there exists 
an optimal solution B* G B*(M) with the following form 

( Qi Ql2 \ = T _ ( B* x Bl, 

V o / J V 

In fact, by assumption, the last columns of M belong to the convex cone of the r first ones and can 
then be set to zero (which is optimal) using only the first r columns (cf. Lemma 2]). Lemma [6] applies 
on matrix Q\ and M(:, l:r) since 

MQ(:, l:r) = M(:, l:r) - Af (:, l:r)B{ > 0, 

while by assumption no column of M(:,l:r) belong to the convex hull of the other columns, so that 
Qi is strictly diagonally dominant hence is a nonsingular M-matrix. □ 

Finally, what really matters is that the vertices of conv(#(M)) are non-repeated. In that case, 
the preprocessing is unique and the preprocessed matrix has the same rank as the original one. The 
fact that Q could be singular is not too dramatic. In fact, given an NMF (U, V) of the preprocessed 
matrix V{M) = MQ ~ UV', we can obtain the optimal factor V for matrix M by solving the nonneg- 
ative least squares problem V = argmin x>0 \ \M — UX\\p (instead of taking V = V'Q^ 1 ) and obtain 
M ps UV. 



4.2 Recovery under Separability 

Definition 7 (Separability). A nonnegative factorization M = UV is called separable if for each i 
there is some column f(i) of V that has a single nonzero entry and this entry is in the ith row, i.e., 
V contains a monomial submatrix. In other words, each column ofU appears (up to a scaling factor) 
as a column of M. 

Note that this assumption is equivalent to the pure pixel assumption in hyperspectral imaging (i.e., 
for each constitutive material present in the image, there is at least one pixel containing only that 
material) [8] or, in document classification (see Section \T7Z\i . to the assumption that, for each topic, 
there is at least one document corresponding only to that topic (or, considering the matrix transpose, 
that there is at least one word corresponding only to that topic 12]). 

Geometrically, this means that the vertices of conv(#(M)) are given by the columns of 9{U). We 
have the following straightforward lemma: 

Lemma 7. M = UV is separable if and only if conv (9 (M)) = conv(0 ([/)). 

Proof. M = UV is separable if and only if V contains a monomial submatrix if and only if the vertices 
of 9{U) and 9{M) coincide if and only if conv(0(M)) = conv(0(C/)). □ 

Theorem 7. If M is separable and the r vertices of0(M) are non-repeated, then V(M) has r non-zero 
columns, say S : \, S : 2, ■ ■ ■ S :r , such that conv(#(M)) C conv (8 (S)), i.e., there exists R>0 such that 
M = SR. 

Proof. This is a consequence of Lemmas H] and [71 □ 
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Theorem [7] shows that the preprocessing is able to identify the r columns of M = UV corresponding 
to the vertices of 9{M). Moreover, it returns a sparser matrix S, namely V(U), whose cone contains 
the columns of M. Remark also that Theorem [7] does not require M to be full rank, i.e., the dimension 
of conv(#(M)) can be smaller than r — 1. 

Corollary 3. For any rank-two nonnegative matrix M whose columns are not multiples of each other, 
V{M) has only two non-zero columns, say S-\ and S-2 such that conv(#(M)) C conv(0(S)), i.e., there 
exists R > such that M = SR. In other words, the preprocessing technique is optimal as it is able 
to identify an optimal nonnegative basis for the NMF problem corresponding to the matrix M. 

Proof. A rank-two nonnegative matrix is always separable. In fact, a two-dimensional pointed cone 
is always spanned by two extreme vectors. In particular rank(M) = 2 <^=^ rank+(M) = 2, see, 
e.g., [35]. □ 

Example 1. Here is an example with a rank-three separable matrix 



5 555914177 
V-[ 10 653784158 
8 994783967 



1 2 3 6 4 4 
01057774 
1 9 4 4 8 6 



(4.2) 



Rs preprocessed version is 



V{M) 




5.93 
0.72 
0.93 




Figure [7] shows the geometric interpretation of the preprocessing. Notice that the preprocessing makes 
the solution to the corresponding NMF problem unique. 




Figure 1: Geometric interpretation of the preprocessing of matrix M from Equation (|4.2 
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4.3 Uniqueness and Robustness Through Preprocessing 

A potential drawback of the preprocessing is that it might increase the nonnegative rank of M. In 
this section, we show how to modulate the preprocessing to prevent this behavior. 

Let us define 

V a (M) = M(I - aB*) = M — aMB*, 

where < a < 1 and B* G B*(M). Notice that V a {M) is well-defined because for any B*, B\ G B*(M) 
we have MB\ = MB%; see Theorem EJ 

Lemma 8. Let M be a nonnegative matrix such that the vertices of eonv(#(M)) are non-repeated. 
Then, for any < a < /3 < X, 

conv(0(M)) C conv(0(P Q (M))) C conv(0(7^(M))) C col(8(M)) n A m . 

Therefore, 

rank + (M) < rank + (V a (M)) < rank + (-p /3 (M)). 

Proof. The proof can be obtained by following exactly the same steps as the proof of Lemma [5j □ 

Lemma 9. Let M be a nonnegative matrix such that the vertices of conv(#(M)) are non-repeated, 
then the supremum 

a = sup a such that iwak + (V a (M)) = rank + (M) (4.3) 

0<o<l 

is attained. 

Proof. We can assume without loss of generality that M does not have all- zero columns. In fact, if 
M : j = for some i then V a {M\^ = for all a G [0, 1] so that the nonnegative rank of V a (M) is not 
affected by the zero columns of M. 

Then, if a = 1, the proof is complete. Otherwise, one can easily check that, for any < a < 1, we 
have V a (M) : i ^ \/i (using the same argument as in Lemma H]). 

Finally, the result follows from the upper-semicontinuity of the nonnegative rank [_4J Theorem 3.1]: 
'If P is nonnegative matrix, without zero columns and with rank + (P) = k, then there exists a ball 
B(P,e) centered at P and of radius e > such that rank + (iV) > k for all iV G B(P,e)\ Therefore, if 
the supremum of (|4.3p was not attained, the matrix 'Pq(M) would satisfy rank + ('Pa(M)) > rank + (M) 
while for any a < a we would have rank + (V a (M)) = rank + (M), a contradiction. □ 

Hence working with matrix V a {M) instead of M will reduce the number of solutions of the NMF 
problem while preserving the nonnegative rank: 

Theorem 8. Let M be a nonnegative matrix for which the vertices of conv (9 (M)) are non-repeated, 
let also a be defined as in Equation f|4.3[) . Then any NMF (U, V) ofP a (M) corresponds to an NMF 
(U, VQ' 1 ) of M, while the converse is not true. In fact, 

conv(0(M)) C conv(6»CP a (M))). 

Therefore, the NMF problem for V a (M) is more well posed. 

Proof. This follows directly from the definition of a, and Lemmas [8] and □ 
We now illustrate Corollary [8] on a simple example, which will lead to three other important results. 
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Example 2 (Nested Squares). Let 



M 



/ 5 3 3 5 \ 

3 5 5 3 

5 5 3 3 

\ 3 3 5 5 ) 



The •problem NPP(M) restricted to the column space of M is made up of two nested squares, conv(#(M)) 
and col(0(M)) n A m , centered at (0,0) with side length 2 and 8 respectively, see Figure® The polygon 
corresponding to V a (M) is a square centered at (0,0) with side length depending on a, between 2 
(for a = 0) and 8 (for a = 1). We can show that the largest such square still included in a triangle 
corresponds to 



( 5 3 3 5 \ 

3 5 5 3 

5 5 3 3 

\ 3 3 5 5 J 



I 1 + a 1 — a 1 — a 1 + a \ 

1 — a 1 + a 1 + a 1 — a 

1 + a 1 + a 1 — a 1 — a 

\ 1 — a 1 — a 1 + a 1 + a j 



(4.4) 



where a = \/2 — 1 and a = g~ (this follows from the proof of Theorem^- see below). Hence, the 
polygon conv(9(T ,a (M))) is a square centered at (0,0) with side length 8a in between conv(#(M)) and 
col(#(M)) n A m , see Figure® Unfortunately, the NMF ofV a (M) is non-unique. In fact, we will see 



col(9(M)) n A 




-5 -4 -3 -2 -1 1 2 3 4 5 

Figure 2: Geometric interpretation of the preprocessing of matrix M from Equation (|4.4p . 



later that it has 8 solutions (the ones drawn on Figure® & n d their rotations). 
Example [2] illustrates the following three important facts: 



Fact 1. Defining a well-posed NMF problem is not always possible. In other words, there 
does not exist any 'reasonable' NMF formulation having always a unique solution (up to permutation 
and scaling). In fact, Example [2] shows that, because of the symmetry of the problem, any solution 
of NPP(M) can be rotated by 90, 180 or 270 degrees to obtain a different solution with exactly 
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the same characteristics (the rotated solutions cannot be distinguished in any reasonable way). For 
example, there are 4 solutions which are the sparsest, each containing one vertex of col(#(M)) n A m , 
see conv (0(t/2)) on Figure [21 including 



U 2 



( 1 a \ 

1 - a 1 

a 1 

\ 1 - a I ) 



and its rotation of 180 degrees U!} 



(180) 



/ 1- a 1 \ 

1 a 

1 - a 1 1 

\ a 0/ 



Fact 2. The preprocessing makes NMF more robust. For any m-by-n matrix E such that 
col(£) C col(M), M + E > 0, and 

conv(6»(M)) C conv((9(M + E)) C conv(P a (M)), 

the exact NMF (£/, V) of V a (M) will still provide an optimal factor [/ for the perturbed matrix M+E. 
In particular, if the matrix M is positive, then one can show tha1@ conv(#(M)) is strictly contained in 
conv('P a (M)) (given that a > 0) so that any sufficiently small perturbation E with col(E) C col(M) 
will satisfy the conditions above. 

In Example [21 the vertices of M can be perturbed and, as long as they remain inside the square 
defined by conv("P a (M)) (see Figure [2]), the exact NMF of conv("P a (M)) will provide an exact 
NMF for the perturbed matrix M. (More precisely, any matrix E such that col(E) C col(M) and 
max,- a \Eij\ < V2 - 1 will satisfy conv((9(M + Ej) C conv(7 :,a (M)).) 

Fact 3. The preprocessing reduces the number of solutions of the NMF problem. In 

Example [21 even though the NMF of "P Q (M) is non-unique, the set of solutions has been drastically 
reduced: from a two-dimensional space to a zero-dimensional one containing eight points (conv(0(Lq)), 
conv(0(£/2)) and the corresponding rotated solutions, see Figure [2]) • 

Theorem 9. Let M G M™ xn be such that rank(M) = rank + (M) = 3 and let a be defined as in 
Equation (|4.3|) , Assume also that conv(0("P(M))) has at least four vertices. Then the number of 
solutions of NPP(P a {M)) with three vertices is smaller than m + n. 

Proof. Let P and Q denote the outer and inner polygons of NPP("P a (M)), respectively. Let us also 
parametrize the boundary of the outer polygon P with the parameter t G [0, 1] and the function 



where x is a continuous function with x{0) = x(l) and {x(t) \ t G [0, 1]} is equal to the boundary of 
P. We also define the function x for values of t larger than one using x(t) = x(t— [t\ ) where [t\ is the 
largest integer not exceeding t. Using the construction of Aggarwal et al. [1], we define the function 



as follows. Let t\ G [0, 1) and x(ti) be the corresponding point on the boundary of P. From x(ti), we 
can trace the tangent to Q (i.e., Q is on one side of the tangent, and the tangent touches Q), say in 
the clock- wise direction, intersect it with P and hence obtain a new point x(t%) on the boundary P 
(see Figure [3] for an illustration on the nested squares problem) . We assume without loss of generality 
that ti > t\ (i.e., if t 2 happens to be larger than one, we do not round it down with the equivalent 
value t 2 — L*2 J ) - Starting from x(t-i), we can use the same procedure to obtain x(t^) and we apply 
this procedure k times to obtain the point x(tk+i), where t^+i > ••• > *2 > H- Finally, we define 
fkih) = tk+i- 

4 Using the same ideas as in Lemma[5]and the fact that any preprocessed column must contain at least one zero entry. 



x : R + -)• M 2 : t (->• x(t) G P, 



f k :R + ^R + :t^ f k (t) 
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Figure 3: Mapping of the point x{t\) to x(ti) using the construction from [1J. 



Aggarwal et al. showed that x(t\) can be taken as a vertex of a feasible solution of NPP (T >a {M)) 
with k vertices if and only if fk(ti) = tk+i > ii + 1, i.e., we were able to turn around Q inside P in 
k + 1 steps (in fact, x(t\), x(ti), . . . , and x{tt) are the vertices of a feasible solution). 

Aggarwal et al. [1] also showed that the function is continuous, non-decreasing, and depends 
continuously on the vertices of Q (see also Appendix [AJ. Figure 2] displays the function f± for the 
nested squares (Example [2]) . 




Figure 4: Function fi{t) for Example [5] using the construction from [T] (see also Figure H] and Ap- 
pendix [A]). We only plot the function f± in the interval [0, |] because, by symmetry, f±(x + |) = 
/ 4 (x) + |. 
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If col(#(M)) n A m has three vertices, then a = 1. In fact, we have that 



6{V a {M)) C col(0(M))n A 



for any < a < 1 



implying rank + (7 :,a (M)) = 3 for all < a < 1. Moreover, because 6(T ,a (M)) has at least four 
vertices, col(0(M)) n A m is the unique solution of the corresponding NPP problem: the outer polygon 
is a triangle while the inner polygon has at least four vertices which are located on the edges of the 
outer triangle (since a = 1 and each column of V{M) contains at least one zero entry). 

Let us then assume that col(#(M)) n A m has at least four vertices. We show that this implies 
a < 1. Assume a = 1. The polygons P = col(0(M)) n A m and Q = 9(P(M)) have at least 4 
vertices. Moreover, the vertices of Q are located on the boundary of P (because a = 1) on at least two 
different sides of P (three vertices cannot be on the same side). It can be shown by inspection that 
the optimal solution of this NPP instance must have at least four vertices, hence rank + (7 :, (M)) > 3, 
a contradiction. 

Next, we show that fi(t) < t + 1. Assume there exists t such that fi(t) > t + 1. By continuity 
of /4 with respect to the vertices of Q = conv(6{V a (M))), there exists e > sufficiently small 
such that a + e < 1 and such that the function for the NPP instance with inner polygon Q' = 
coiYv(8(T ,a+€ (M))) and the same outer polygon P satisfies fi(t) > t + 1 hence rank + (T ,a+e (M)) < 3, 
a contradiction. 

In Appendix^] we prove that ff. is piecewise constant /strictly convex for any k, i.e., fk is made up 
of pieces which are either constant or strictly convex, with at most m + n break points corresponding 
to different solutions to the NPP. Therefore, because f$ is continuous and smaller than t + 1, it 
can intersect the line t + 1 only at the break points. Since there are at most m + n such points 
corresponding to different NPP solutions, the number of solutions of NPP("P a (M)) with three vertices 
is smaller than m + n. (Notice that the bound is tight since it is achieved by the nested squares 
example with 8 solutions.) □ 

Remark 3. If conv (9 (V(M))) has three vertices, they define a feasible solution for the corresponding 
NPP problem (i.e., V(M) is separable, see Theorem^). However, the number of solutions might be 
not be finite in that case. Here is an example 



whose corresponding NPP problems are represented on Figured the NPP of'V(M) does not have a 
finite number of solutions. 



M = 



I 0.5 0.25 \ 

1 0.5 0.75 1 

1 0.1 0.5 

V 1 0.9 0.5 / 



and V{M) 



I 0.5 \ 

1 0.5 0.3 0.5 

10 

\ 1 0.3 0.5 / 



0.2 



0.8 



0.6 



0.4 












0.2 



04 



0.6 



0.8 



Figure 5: Counter-example for Theorem [9] when V(M) has three vertices. 
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The fact that the NPP of the matrix V a (M) can have several different solutions is untypical and, 
we believe, could be due to the symmetry of the problem (as in Example [5]). We conjecture that, in 
general, the solution to NPP("P a (M)) is unique. In particular, we observed on randomly generated 
matrices that it was, see, e.g., Example[TJ In fact, as the function fk(-) defined in Theorem [U] depends 
continuously on the inner and outer polytopes Q and P, if these polytopes are generated randomly, 
there is no reason for the values of the function /&(.) at the break points to be located on the same 
line as on Figure HI 

We also conjecture that Theorem [9] holds true for any rank: 

Conjecture 1. Let M be such that rank(M) = rank + (M) = k and conv(9(V(M))) has at least (k + l) 
vertices, and a be defined as in Equation (|4.3p . then the number of solutions of NPP(P a {M)) is finite. 

Unfortunately, the geometric construction of Aggarwal et al. [1] cannot be generalized to three 
dimensions (or higher). To prove the conjecture, we would need to show that 

o Any solution of NPP(T ,a (M)) is isolated. Intuitively, the preprocessing V a {M) of M grows 
the inner polytope Q as long as the corresponding NPP instance has a solution with rank + (M) 
vertices. If a solution was not isolated, it could be moved around while remaining feasible, which 
indicates that we could grow the inner polytope Q hence increase a. 

o The number of isolated solutions is finite. We conjecture that the solutions can be characterized 
in terms of the faces of P and Q, which are finite (depending on m and n). 

Remark 4. Of course computing a is non-trivial. However, for matrices of small rank, this could be 
done effectively. In fact, checking whether the nonnegative rank of an m-by-n is equal to rank(M) can 
be done in polynomial time in m and n provided that the rank is fixed [2]. In particular, the algorithm 
of Aggarwal et al. fl^ does it in 0{{m + n) log(min(m, n))) operations for rank-three matrices [16]. 
Pence, one could for example use a bisection method to find a good lower bound (3 < a and use the 
corresponding matrix NPP(T , P{M)) to have a more well-posed NMF problem whose solutions will be 
solutions of the original one. 

5 Preprocessing in Practice 

In this section, we address three important practical considerations of the preprocessing. 
5.1 Computational Complexity of Solving (13.51) 

It is rather straightforward to check that problem (|3.5p can be decoupled into n independent CLLS's, 
each corresponding to a different column of M; for example, for the ith column of M, we have 

min \\M-i - Mb\\l such that Mi > Mb, h = 0. (5.1) 

We then have n CLLS's with n variables (actually n — 1 since variable b{ = can be removed) and 
m + n constraints. Using interior point methods, the computational complexity for solving (|5.ip is of 
the order of 0(n 3 ' 5 ); hence the total computational cost is of the order 0(n ). 

Figure [6] shows the computational time needed for solving (|3.5p with respect to m for n fixed 
and vice versa, for randomly generated matrices (using the rand(.) function of MATLAB®) on a 
laptop 3GHz Intel® CORE i7-2630QM CPU @2GHz 8Go RAM running MATLAB® R2011b using 
the function lsqlin(.) of MATLAB®. The computational time is linear in m while being of the order 
of n 3 in n, smaller than the expected 0(n 4 ' 5 ). Therefore, in practice, the dimension m can be rather 
large while, on a standard machine, n cannot be much larger than 1000. Using parallel architecture 
would allow to solve larger scale problems (see also Section [7|) . 
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Figure 6: Computational time for solving (|3.5p . On the left, m-by-100 randomly generated matrices; 
on the right, 1000-by-ra randomly generated matrices (plain) and the polynomial 2.610 -4 n 3 (dashed). 



5.2 Normalization of the Columns of V(M) 

Since the aim eventually is to provide a good approximate NMF to the original data matrix M, we 
observed that normalizing the columns of the preprocessed matrix V(M) to match the norm of the 
corresponding columns of M gives better results. That is, we replace V{M) with DV{M) where 



Dm = ,,Jj,, '^ 2 , , for all i, and D« = for all i ^ j. 
||P(M) :i || 2 ' 13 ^ J 

This scaling does not change the nice properties of the preprocessing since D is a monomial matrix, 
hence QD still is an inverse-positive matrix. This scaling degree of freedom is related to the fact that 
we fixed the diagonal entries of Q to one, see Section 13.21 

The reason for this choice is that NMF algorithms are sensitive to the norm of the columns of M. 
In fact, when using the Frobenius norm, we have that the following two problems are equivalent 



nun \\M-UV\\l = min V||M.i||2 

u>o,v>o x>o,y>o 
- - - i=i 



M. 



M. 



XY, 



i 2 



2 



2 



Therefore, to give each column of V(M) the same importance in the objective function as in the 
original NMF problem, it makes sense to use the scaling above. This is particularly critical if there 
are outliers in the dataset: the outliers do not look similar to the other columns of M hence their 
preprocessing will not reduce much their ^-norm (because they are further away from the convex cone 
generated by the other columns of M). Therefore, their relative importance in the objective function 
will increase in the NMF problem corresponding to V(M), which is not desirable. 

5.3 Dealing with Noisy Input Matrices and/or Obtaining Sparser Preprocessing 

Our technique will typically be useless when the input matrix is noisy and sparse. For example, we 
have 

\ / 00 \ [05 

M= | 1 j ,V(M)= I 1 I while M n = I 1 | = V{M n ) 
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for any 5 > 0. This shows that the preprocessing is very sensitive to small positive entries of M. In 
order to deal with such noisy and sparse matrices, we propose to relax the nonnegativity constraint 
M > MQ in (|3.5p . and solve instead 



mm 



2 

in V Mj-VMA such that M. A + eWM.^e > S^M^Bu, Vi, (5.2) 

nxra ^ — ^ || -1 — ^ 2 ^ — ^ 

i=l k^i kj^i 

where < e <C 1 and e is the vector of all ones of appropriate dimension. We will denote the 
corresponding preprocessing V t (M) = M(I — B*) where B* is an optimal solution of (|5.2p . For the 
example above with S = e = 10~ 2 , we obtain 

-10~ 2 1(T 2 
V e (M n ) = | 1 -10- 2 
10~ 4 0.99 

In practice, this technique also allows to obtain preprocessed matrices with more entries close (or 
smaller) than zero. When choosing the parameter e, it is very important to check whether p(B*) < 1 
so that the rank of "P e (M) is equal to the rank of M and no information is lost (i.e., we can recover 
the original matrix M = V e (M)(I - B* e )~ l given V e {M) and B*). 



6 Application to Image Processing 

In this section, we apply the preprocessing technique to several image datasets. By construction, the 
preprocessing procedure will remove from each image a linear combination of the other images. As we 
will see, this will highlight certain localized parts of these images, essentially because the preprocessed 
matrices are sparser than the original ones. We will then show that combining the preprocessing 
with standard NMF algorithms naturally leads to better part-based decompositions, because sparser 
matrices lead to sparser NMF solutions, see Section [2j 

A direct comparison between NMF applied on the original matrix and NMF applied on the pre- 
processed matrix is not very informative in itself: while the former will feature a lower approximation 
error, the latter will provide a sparser part-based representation. This does not really tell us whether 
the improvements in the part-based representation and sparsity are worth the increase in approxi- 
mation error. For that reason, we chose to compare them with a standard sparse NMF technique, 
described below, in order to better assess whether the increase in sparsity achieved is worth the loss 
in reconstruction accuracy. Hence, we compare the following three different approaches: 

o Nonnegative matrix factorization (NMF). It solves the original NMF problem from Equa- 
tion (jl.ip using the accelerated HALS algorithm (A-HALS) from [18J (with parameters a = 0.5 
and e = 0.1 as suggested in [IE]), which is a block coordinate descent method. 

o Preprocessed NMF for different values of e. It first computes the preprocessed ma- 
trix V e (M) (cf. Section I5.3P , then solves the NMF problem for the rescaled preprocessed ma- 
trix Ve{M)D « UV' (cf. Section E2]) using A-HALS and finally returns (U,V) where V = 
argminj 4 - >0 | \M — UX\\ 2 F . This approach will be denoted Pre-NMF(e). (We will also indicate 
in brackets the error obtained when using V = V'Q -1 , which will be, by construction, always 
higher.) Notice that the preprocessed matrix may contain negative entries (when e > 0) which 
is handled by A-HALS. We do not set these entries to zero for two important reasons: (i) we 
want to preserve the column space of M, (ii) the negative entries of M lead to sparser NMF 
solutions. In particular, it was shown that if an entry of M, say at position is smaller than 

-||max(0,M)||jr then (UV)ij = for any optimal solution of NMF [15]. 
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o Sparse NMF. The most standard technique to obtain sparse solutions for NMF problems is to 
use a sparsity-inducing penalty term in the objective function. In particular, it is well-known 
that adding an /i-norm penalty term induces sparser solutions (see, e.g., [22]), and we therefore 
solve the following problem: 

r 

min \\M-UV\\ 2 F + Y MiMli, IMoo = 1 Vi, (sNMF) 
U,V>0 ^— ' 

1=1 

where ||x||i = ^ I •^i|> Halloo — maxj \xi\ and \%i are positive parameters controlling the sparsity 
of the columns of U. In order to solve sNMF, we also use A-HALS which can easily be adapted to 
handle this situation. The ^-norm constraints is not restrictive because of the degree of freedom 
in the scaling of the columns of U and the corresponding rows of V, while it prevents matrix U 
to converge to zero. The theoretical motivation is that the Zi-norm is the convex envelope of the 
Zo-norm (i.e., the largest convex function smaller than the /o-norm) in the ^oo-ball, see [31] and 
the references therein. 

In order to compare sparse NMF with Pre-NMF(e), the parameters /Uj 1 < i < r are tuned 
in order to match the sparsity obtained by Pre-NMF(e). The corresponding approach will be 
denoted sNMF(e). 

For each approach, we will keep the best solution obtained among the same ten random initializa- 
tions (using the rand(.) function of MATLAB®) and each run was allowed 1000 (outer) iterations of 
the A-HALS algorithm. We will use the relative error 

\\M- UV\\ F 
\\M\\f 

to asses the quality of an approximation. We will also display the error obtain by the truncated 
singular value decomposition (SVD) for the same factorization rank to serve as a comparison. For the 
sparsity, we use the proportion of non-zero entrie^l 

#zeros(c0 g for U G M mxr . 

mr 

Because the solution computed with Pre-NMF does not directly aim at minimizing the error 
\\M — UV\\ F , it is not completely fair to use this measure for comparison. In fact, it would be better 
to compare the quality of the sparsity patterns obtained by the different techniques. For this reason, 
we use the same post-processing procedure as described in [17] which benefits all algorithms: once 
a solution is computed by one of the algorithms, the zero entries of U are fixed and we minimize 
min[/>o,y>o \ \M — UV\\ F on the remaining (nonzero) entries (again, A-HALS can easily be adapted 
to handle this situation and we perform 100 additional steps on each solution), and report the new 
relative approximation error as "Improved". 

The code is available at https://sites.google.com/site/nicolasgillis, 



6.1 CBCL Dataset 

The CBCL face datasetff] is made of 2429 gray-level images of faces with 19 x 19 pixels (black is one 
and white is zero). We look for an approximation of rank r = 49 as in [25]. Because of the large 
number of images in the dataset, the preprocessing is rather slow. In fact, we have seen in Section \5. II 
that it is in 0(n 4 ' 5 ) where n is the number of images in the dataset (it would take about one week on 

5 The negative entries of the preprocessed matrix V e (M) for e > will be counted as zeros. 
6 Available at http : //cbcl .mit . edu/sof tware- dataset s/FaceData2 .html 
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a laptop). Therefore, we only keep every third image for a total of 810 images, which takes less than 
three hours for the preprocessing; about 10-15 seconds per image^. 

Table [1] reports the sparsity and the value of p(B*) for the preprocessed matrices with different 
values of the parameter e. As explained in Section the sparsity of V e (M) increases with e, and e 
was chosen to make sure that p(B*) < 1 implying rank(7-e(M)) = rank(M). 

Table 1: CBCL dataset: sparsity of the preprocessed matrices V t (M) = MQ and corresponding 
spectral radius of B* = I — Q. 





M 


T{M) 






s(.) 





0.001 


20.92 


38.03 







0.71 


0.83 


0.90 



Figure displays a sample of images of the CBCL dataset along with the corresponding prepro- 
cessed images for different values of e. 




Figure 7: From top to bottom: CBCL sample images, corresponding preprocessed images for e = 0, 
e = 0.05, and e = 0.1. 

We observe that the preprocessing is able to highlight some parts of the images: the eyes (faces 
5 and 9), the eyebrows (faces 3, 4, 8, 10, 11, 13 and 16), the mustache (faces 14 and 15), the glasses 
(faces 6, 7 and 12), the nose (faces 1 to 4) or the mouth (faces 1 to 5). Recall that the preprocessing 
removes from each image of the original dataset a linear combination of other images. Therefore, the 
parts of the images which are significantly different from the other images are better preserved, hence 
highlighted. 

We now compare the three approaches described in the introduction of this section. Table [2] gives 
the numerical results and shows that Pre-NMF performs competitively with sNMF in all cases (similar 
relative error for similar sparsity levels). 

Figure[5]displays the basis elements obtained for NMF, Pre-NMF(O), Pre-NMF(O.l) and sNMF(O.l). 
The decomposition by parts obtained by Pre-NMF(O.l) is comparable to sNMF(O.l), reinforcing the 
observation above (cf. Table [2]) that Pre-NMF performs competitively with sNMF. 

Our technique has the advantage that only one parameter has to be chosen (namely e) and that 
sparse solutions are naturally obtained. In fact, the user does not need to know in advance the 
desired sparsity level: one just has to try different values of e G [0, 1] (making sure p(B*) < 1) and a 

7 The MATLAB® function Isqlin for solving CLLS problems is much slower than quadprog with interior point (which 
is much faster than quadprog with active set). 
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Table 2: Comparison of the relative approximation error and sparsity for the CBCL image dataset. 





Plain 


Improved 


s(U) 


s(V) 


SVD 


7.28 


7.28 








NMF 


7.97 


7.96 


53.27 


11.36 


Pre-NMF(O) 


9.28 (9.76) 1 " 


8.37 


76.78 


4.42 


sNMF(O) 


8.34 


8.20 


77.62 


5.19 


Pre-NMF(0.05) 


11.12 (12.66) 


9.15 


90.14 


2.16 


sNMF(0.05) 


9.24 


8.90 


91.12 


2.22 


Pre-NMF(O.l) 


13.12 (23.47) 


9.88 


94.58 


1.17 


sNMF(O.l) 


10.30 


9.89 


94.77 


1.14 



In brackets, it is the error obtained when using V = V'Q 1 , instead of V = argmin x> Q IIM - UX\\l. 



sparse factor U will automatically be generated (no parameters have to be tuned in the course of the 
optimization process). Moreover, Pre-NMF proves to be less sensitive to initialization than sNMF: we 
rerun both algorithms for e = 0.1 with 100 different initializations (using exactly the same settings as 
above) and observe the following: 

o Among the hundred solutions generated by sNMF(O.Ol), three did not achieve the required 
sparsity (being lower than 0.85, while all others were around 0.95 as imposed). In particular, 
the variance of the sparsity of the factor U for PreNMF(O.Ol) is 8.69 10~ 7 while it is much 
higher 1.31 10 -3 for sNMF(O.Ol). (Note that after removing the three outliers, the variance of 
sNMF(O.Ol) is still higher being 3.23 10" 6 .) 

o The average of the relative error of Pre-NMF(O.Ol) is 9.94, slightly lower than sNMF(O.Ol) with 
9.96. 

o The variance of the relative error of Pre-NMF(O.Ol) is 5.97 10~ 3 , lower than sNMF(O.Ol) with 
2.36 10~ 2 . 

Remark 5. We have also tested other sparse NMF techniques and they could not match the results 
obtained by sNMF, especially for high sparsity requirement. In particular, we tested the following 
standard formulation using only one penalty parameter (see, e.g., [22]) 

mnJlM-C/yill + zi^ll^ll!, 

i 

and the algorithm of Hoyer [20]. 

6.2 Hubble Telescope Hyperspectral Image 

The Hubble dataset consists of 100 spectral images of the Hubble telescope, 128 x 128 pixels each [28j. 
It is composed of eight materials^]; see the fourth row on Figure [10l The preprocessing took about 
one minute (about 0.5 second per imagejE Figure [9] displays a sample of images of the simulated 
Hubble database along with the corresponding preprocessed images. The preprocessing for e = 0.01 
highlights extremely well the constitutive parts of the Hubble telescope: it is in fact able to extract 
some materials individually. Table Ogives the sparsity and the value of p for the different preprocessed 
matrices. 

8 These are true Hubble satellite material spectral signatures provided by the NASA Johnson Space Center. 
9 The MATLAB® function Isqlin for solving CLLS problems was again much slower (about ten times) than quadprog 
with active set or with interior point which were comparable in this case. 
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Figure 8: From left to right, top to bottom: basis elements for the CBCL dataset obtained with NMF, 
Pre-NMF(O), Pre-NMF(O.l) and sNMF(O.l). 

Table 3: Hubble dataset: sparsity of the preprocessed matrices V t {M) = MD and corresponding 
spectral radius of B* = I — D. 





M 


V(M) 




s(.) 

p(B* e ) 


57 



57 
0.9808 


80 
0.9979 



Table H] reports the numerical results. Although sNMF(O.Ol) identifies a solution with slightly 
lower reconstruction error than Pre-NMF(O.Ol) (2.90 vs. 2.93), it is not able to identify the constitutive 
materials properly while Pre-NMF(O.Ol) perfectly separates all eight constitutive materials. It is also 
important to point out that the solutions generated by Pre-NMF(O.Ol) with different initializations 
correspond in most casea^l to this optimal decomposition while the solutions generated by sNMF are 
typically very different (and with very different objective function values). This indicates that the 

10 We used 100 random initializations and obtained 61 times the optimal decomposition (in the other cases, it is always 
able to detect at least six of the eight materials). 



26 





\* 




\ 


V 


\ 




v 


V 






V 




\4 










% * 




-;.J ^ 


- •; 


----- 

>* 




■ 

> 


\ \ 


4 
















-<\ 


• 



Figure 9: From top to bottom: Sample of Hubble images, corresponding preprocessed images for e = 
and e = 0.01. 



NMF problem corresponding to the preprocessed matrix is more well posed I. 

Table 4: Comparison of the relative approximation error and sparsity for the Hubble dataset. 





Plain 


Improved 


s(U) 


s(V) 


SVD 


0.01 


0.01 


58 





NMF 


0.06 


0.05 


58.02 


2.25 


Pre-NMF(O) 


0.08 (0.08) 


0.07 


59.16 


0.13 


sNMF(O) 


0.37 


0.36 


64.14 


0.63 


Pre-NMF(O.Ol) 


14.08 (75.09) T 


2.93 


93.71 





sNMF(O.Ol) 


3.39 


2.90 


93.94 






^Notice that for e = 0.01, the solution obtained using V = V'Q 1 has a very high reconstruction error; the reason 
being that Q = (J — B*) is close to being singular since p(B*) = 0.9979. 

The comparison between sNMF(0) and Pre-NMF(O) is also interesting: the basis elements gen- 
erated by Pre-NMF(O) (see second row of Figure [T0|) identify the constitutive materials much more 
effectively as six of them are almost perfectly extracted, while sNMF(0) only identifies one (while 
another is extracted as two separate basis elements). 

7 Conclusion and Further Research 

In this paper, we introduced a completely new approach to make NMF problems more well posed and 
have sparser solutions. It is based on the preprocessing of the nonnegative data matrix M: given M, 
we compute an inverse positive matrix Q such that the preprocessed matrix V(M) = MQ remains 
nonnegative and is sparse. The computation of Q relies on the resolution of constrained linear least 
squares problems (CLLS). We proved that the preprocessing is well-defined, invariant to permutation 
and scaling of the columns of matrix M, and preserves the rank of M (as long as the vertices of 
conv(#(M)) are non repeated). 

Because V{M) is sparser than M, the corresponding NMF problem will be more well posed and 
have sparser solutions. In particular, we were able to show that 

o Under the separability assumption of Donoho and Stodden [12], the preprocessing is optimal as 
it identifies the vertices of the convex hull of the columns of M. 

n Of course, in general, even if an NMF formulation has a unique global minimum (up to permutation and scaling), 
it will still have many local minima. Therefore, even in that situation, solutions generated with standard nonlinear 
optimization algorithms might still be rather different for different initializations. 
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Figure 10: From top to bottom: basis elements for the Hubble dataset obtained by NMF, Pre-NMF(O), 
sNMF(O), Pre-NMF(O.Ol) and sNMF(O.Ol). 



o Since any rank-two matrix satisfies the separability assumption, the preprocessing is optimal for 
any rank-two matrix. 

o In the exact rank-three case (i.e., M = UV, rank(M) = rank + (M) = 3), the preprocessing can 
be used to make the set of optimal solutions of the NMF problem finite. We conjecture that, 
generically, it makes it unique and that this result holds for higher rank matrices. 

We also proposed a more general preprocessing that relaxes the constraint that V(M) has to be 
nonnegative, which is able to deal better with noisy and sparse matrices. Moreover, it generates sparser 
preprocessed matrices hence sparser NMF solutions. We experimentally showed the effectiveness of 
this strategy on facial and hyperspectral image datasets. In particular, it performed competitively 
as a state-of-the-art sparse NMF technique based on £i-norm penalty functions. It is robust to high 
sparsity requirement and no parameters have to be tuned in the course of the optimization process. 
Only one parameter has to be chosen which will allow the user to generate more or less sparse pre- 
processed matrices. 

The main drawback of the technique seems to be its computational cost: n CLLS problems in n 
variables and m + n constraints have to be solved (where n in the number of columns of M) for a 
total computational cost of the order of 0(n 4 ' 5 ) (using MATLAB® on a standard laptop, it limits n 
to be smaller than 1000 for a few hours of computation). It would then be particularly interesting 
to investigate strategies to speed up the preprocessing. Using faster solvers is one possible approach 
(probably in detriment of the accuracy), e.g., based on first-order methodJ^l. Another possibility 
would be to use the following heuristic: since the preprocessing removes from each column of M 

12 We are currently developing an ADM algorithm, along with Ting Kei Pong, which allowed us to preprocess the 
CBCL dataset in about 10 hours with 10 -3 relative accuracy; the code is available upon request and should be available 
soon. 
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a linear combinations of the other columns, one could use only a subset of k columns of M to be 
subtracted from the other columns of M. This amounts to fixing variables to zero in the CLLS 
problems and would reduce the computational complexity to 0(nk 3 ' 5 ). This subset of columns could 
for example be selected such that its convex hull has a large volume, see, e.g., [23] for a possible 
heuristic; or such that they form the best possible basis for the remaining columns (i.e., use a column 
subset selection algorithm); see [5] and the references therein. 

Finally, a particularly challenging direction for research would be to design other data preprocessing 
techniques for NMF. One approach would be to characterizing the set of inverse positive matrices 
better: in this paper, we only worked with the subset of invertible M-matrices. For example, the 
matri^^l 

/Oil 
M=\ 1 1 
\ 1 1 

would not be modified by our preprocessing (because each column contains a zero entry corresponding 
to positive ones in all other columns) although its NMF is not unique (cf. Section [2]). In fact, we have 
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where Q is inverse positive with Q~ x = \M, and the NMF of MQ is unique. This examples shows 
that working with a larger set of inverse positive matrices would allow to obtain sparser preprocessed 
data matrices, hence more well-posed NMF problems with sparser solutions. 
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A Proof for Theorem [9] 



In this section, we prove that the function fk defined in Theorem [S] is continuous and made up of 
pieces which are either constant or strictly convex (which we refer to as piecewise constant/strictly 
convex). The construction described below is the same as the one proposed by Aggarwal et al. [1] and 
we refer the reader to that paper for more details. The novelty of our proof is to use that construction 
to show that fk is piecewise constant/strictly convex (it was already shown to be continuous and 
nondecreasing in [1]). 

Proof. Let x(t\) be on the boundary of P and define the sequence xfo), ■ ■ ■ , x(tk+i) as in Theorem^] 
(clock-wise). As shown by Aggarwal et al. pQ, the function fk(ti) = ifc+i only depends on 

1. The sides of P on which the points x{tj) 1 < i < k + 1 lie ; 

2. The intersections of the segments [x(ti), x(ti + \)] 1 < i < k with Q ; 

and, given that these sides and intersections do not change, fk is continuously differentiable and can 
be characterized in closed form (see below). These sides and intersections will change when either 

o One of the points x(ti) switches from one side of the boundary of P to another. These points 
correspond to the vertices of P (P has at most m vertices since it is a polygon defined with m 
inequalities); or, 

o One of the intersections of the segments [x(ti), x{ti + i)\ 1 < i < k with Q changes. There is a 
one-to-one correspondence between these points and the sides of Q (Q has at most n vertices 
hence at most n sides). 

These points where the description of fk changes (and where fk is not continuously differentiable) 
are called the contact change points. Turning around the boundary of P, we might encounter more 
than m + n such points. However, two contact change points corresponding to the same change are 
associated with the same sequence x(ti) 1 < i < k+1 hence the same solution to the NPP. In fact, both 
sequences must share at least one point (either a vertex of P or the intersections of a line containing a 
side of Q with the boundary of P) which implies, by construction, that they are the same. Therefore, 
there are at most m + n contact change points corresponding to different sequences x(ti) 1 < i < k + 1 
on the boundary of P pQ. 

It remains to show that the pieces of fk between two contact change points are either constant or 
strictly convex. 

Let us then construct the function fk between two contact change points. Without loss of generality, 
we may assume that the perimeter of the outer polygon P is equal to one (otherwise scale the polygons 
P and Q accordingly), and that the parametrization x of the boundary of P has the following property: 
the distance traveled when following the boundary between x(s) and x(t) is equal to | (s— [s\ ) — (t— [t\)\. 
In particular, if < s < t < 1, then the distance traveled between x{t) and x(s) along the boundary 
of P is t — s. We may also assume without loss of generality that x(0) = (0,0) is the vertex on P 
preceding x(t\) and that x(ti) = (0, ti): this amounts to translating and rotating P and Q. We also 
define (see Figure [11] for an illustration) 

° Q = Q2), the tangent point on Q between x(t±) and xfo)- 

o 9, the angle between the sides of P on which x{t\) and xfo) are. 

o p, the intersection between the sides on which x(t\) and xfo) are (note that p is on the boundary 
of P if and only if there is one and only one vertex of P between x(ti) and ^(£2))- 

o d, the distance between x(0) and p. 
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o s, the distance between p and xfa). 

o a, the projection of q on the line [x(0),p]. 

o b, the projection of xfo) on the line [x(0),p]. 




d 



Figure 11: Construction of the function f\ between two contact change points (see Aggarwal et al. [U 
Figure 3] for a similar illustration). 

Case 1: The point q is on the same side as x(ti). This implies that xfo) = p for any t\ < q± and 
no other points of the sequence is changed since xfo) remains the same. Therefore, the function 
tk+i = fk{t\) is constant. (Notice that x(q\) is a contact change point since xfo) will switch side 
when ti = q\.) 



Case 2: The point q is on the same side as xfo). This implies that xfo) = q for any t\ < d. 
Therefore, the function t^+i = fk(ti) is constant. (Notice that the next contact change point will be 
the first vertex of P that x(t\) crosses.) 

Case 3: The point q is not on the same side as x(ti) or x{t2) (i.e., it is in the interior of P). Using 
the similarity between the triangles Ax(ti)aq and Ax(ti)bx(t2), we have that [U Equation (1)] 

<72 s sin(#) 

qi — t\ d— t\ + s cos(#) ' 

implying 

= __Q2 d-h = 

S sin(0)gi- 92 cot(0)-ti 91{1> ' 

Let us show that gi(t\) is strictly convex, i.e., g"(ti) > 0. Since q is not on the same side as x(t\) or 
xfo), we have q<i > and < 6 < tt implying gffg^ > 0. Hence it suffices to show that h(t\) = 
is strictly convex, where I = qi — q2cot(0). Since s > and d > t\, we must have I — t\ > 0. (Notice 
that x(l) is a contact change point. In fact, for t± = I, the segments [x(ti),q] and [p, xfo)] become 
parallel implying that the intersection of Q with the segment [x(t\), xfo)] will change.) 
We then have 

h ' {tl) = (T^F 
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Since h is a strictly increasing function of t\ pQ, h'(ti) > hence d > I and 



*»(*)- 2^ >0, 



so that gi(t) is strictly convex. Finally, we have 

h(h) = t 2 = ci + s = ci+ gi(h), 

where either 

o ci = and g\ is a constant (cases 1. and 2.). 

o ci is an appropriate constant and g\ is an increasing and strictly convex function (case 3.). 
By construction, the same relationship will apply between t 2 and £3 with 

/2(*i) =t3 = c 2 + g 2 (s) = c 2 + g 2 (gi (h)), 

where c 2 is an appropriate constant and g 2 is either constant, or strictly convex and increasing. After 
k + 1 steps, we have 

fk{h) = t k +i = c k + g k {s) = c k + {g k o g k ^ o • • • o gi ) fa), 

where c k is an appropriate constant and the functions gi are either constant, or strictly convex and 
increasing. If one of the functions gi 1 < i < k is constant, then f k is constant. Otherwise the function 
fk(ti) = c fc + (gk-i " " " °fi f i)(^i) is strictly convex since it is a constant plus the composition of strictly 
convex and increasing functions. (In fact, the composition of one-dimensional increasing and strictly 
convex functions is increasing and strictly convex.) □ 
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